AITopics | re-evaluating evaluation

Collaborating Authors

re-evaluating evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Re-evaluating evaluation

David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel

Neural Information Processing SystemsFeb-14-2026, 15:22:46 GMT

Consider An n= grad (r)+ A grad (r)+ C| 0 B @ 01 10 ... 1 C

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Leisure & Entertainment > Games (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

Re-evaluating evaluation

Neural Information Processing SystemsNov-20-2025, 23:01:00 GMT

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.

evaluation, name change, re-evaluating evaluation, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Reviews: Re-evaluating evaluation

Neural Information Processing SystemsOct-8-2024, 04:22:40 GMT

This paper applies game theory (of Nash equilibria, though 2-player zero-sum games suffice for most of the arguments) to the problem of evaluating a number of agents on a slate of tasks (and/or against other agents). The evaluation methods presented have the important property of being robust to non-systematic addition of more agents and tasks. The paper casts the problem in a correct mathematical framework leveraging the Hodge decomposition of skew-symmetric matrices, and its generalization in combinatorial Hodge theory. This allows for a very illuminating and general view of the structure of evaluation, leading to a generalization of the Elo rating that eschews the transitivity assumption by embedding ratings in more dimensions, as is required in general, and a "Nash averaging" method which I think will be the more lasting contribution. The position taken by the paper on evaluation is also increasingly important at the current time, in my opinion. There are strong connections to the contextual bandits setting in the dueling case, in which feedback is limited to preferences among pairs of actions, which the learner chooses on the fly ([1] and related work).

evaluation, generalization, re-evaluating evaluation, (5 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games (0.57)

Technology:

Information Technology > Game Theory (0.57)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.37)

Add feedback

Re-evaluating evaluation

Balduzzi, David, Tuyls, Karl, Perolat, Julien, Graepel, Thore

Neural Information Processing SystemsFeb-14-2020, 12:12:40 GMT

evaluation, nash, re-evaluating evaluation, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.52)

Add feedback